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Exponential Bounds for Convergence of Entropy Rate 
Approximations and Rate of Memory Loss in Hidden Markov 
Models Satisfying a Path-Mergeability Condition 



Nicholas F. Travers 



Abstract 

A hidden Markov model (HMM) is said to have path-rnergeable states if for any two states i,j there 
exists a word w and state k such that it is possible to transition from both i and j to k while emitting 
w. We show that for a finite HMM with path-mergeable states the block estimates of the entropy rate 
converge exponentially fast, and also that the initial state is almost surely forgotten at an exponential 
rate. 



1 Introduction 

Oh; 

(-h , Hidden Markov models (HMMs) are generalizations of Markov chains in which the underlying Markov state 

sequence (St) is observed through a noisy or lossy channel, leading to a (typically) non-Markovian output 
process (X t ). They were first introduced in the 50s as abstract mathematical models [TJ-[3] , but have since 
been applied quite successfully in a number of contexts for modeling, such as speech recognition [4HZ] and 
bioinformatics [SHE]. 

One of the earliest major questions in the study of HMMs [3] was to determine the entropy rate of the 

^ \ output process: 

t-H ! 

00; h=hm_H(X n \X 1 ,...,X n ^. 1 

Somewhat surprisingly perhaps, in comparison with the case of Markov chains, this turns out to be quite 
difficult. Even for finite HMMs no general closed form expression is known, and it is widely believed that 
no such formula exists. A nice integral expression was provided in [3], but it is with respect to an invariant 
density that is not directly computable. 

In practice, the entropy rate h is instead often estimated simply by the finite-block approximations: 



h(n) = H(X n \X 1 ,...,X n _ 1 ) 

H ■ ■ rrm 

as in |13j . Thus, it is important to know about the rate of convergence to ensure the quality of these 
estimates. 

No general bounds are known. However, in reference [M] a good exponential upper bound on the rate of 
convergence is shown for finite, functional HMMs with strictly positive transition probabilities. In 1151116] we 
have also demonstrated (by quite different methods) exponential convergence for finite, unifilar, edge-emitting 
HMMs. Here we prove exponential convergence for finite HMMs (both state-emitting and edge-emitting) 
satisfying the following simple path-mergeability property: For each pair of distinct states i,j there exists a 
word w and state k such that it is possible to transition from both i and j to k while emitting w. 

Additionally, we show that a finite HMM satisfying this path-mergeability property will a.s. forget its 
initial condition, and do so at an exponential rate. Similar questions on the rate of memory loss in state- 
emitting HMMs have also been studied by several other authors, for instance [T7l425j . but primarily in the 
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case where the observed process (X t ), as well as sometimes the underlying state space, are continuous. And 
often with specifically Gaussian noise in the observations. So, the questions are intuitively quite similar, 
but the technicalities tend to differ considerably Though, we will comment more on these relations in the 
discussion in Section [HI 

Our general method of proof (both for establishing bounds on the convergence of the entropy rate 
estimates and rate of memory loss) is closely related to the original coupling argument used in but 
is somewhat more involved because our path-mcrgeability assumption is weaker than the strict positivity 
of state transitions assumed in that work. The main additional step is to apply large deviation estimates 
to a reverse-time generation process to show that there exists a set of "good" length- [t + 1) sequences Gt 
of combined probability 1 — O (exponentially small), for which a coupling method like that in j!4) may be 
applied. 

The structure of the paper is as follows. In Section [2] we introduce the formal framework for our results, 
including more complete definitions for hidden Markov models and their various properties, as well as the 
entropy rate and its finite-block estimates. In Section[3]we define two important auxilliary probability spaces 
that will be necessary for our proofs. In Section |4] we provide proofs of our exponential convergence results 
for edge-emitting HMMs satisfying the path-mergeability property In Section [5] we use the edge-emitting 
results to establish analogous results for state-emitting HMMs. Finally, in Section [5] we discuss relations to 
previous work and conditions assumed by other authors in more detail, as well as some possible extensions 
of the current results. 

2 Definitions and Notation 

2.1 The Entropy Rate and Finite-Block Estimates 

Definition 1. For a discrete random variable X with probability mass function p(x) , the entropy H{X) is: 

H(X) = -^p(x)]Dg 2 p(x) 

x 

Definition 2. For discrete random variables X and Y with joint probability mass function p(x,y), the 
conditional entropy H(X\Y) is: 

H(x\Y)^Y.p(y)- H ( x \ Y = y) 

v 

= -^2p(y)^2p(x\y)^og 2 p(x\y) 

V X 

In these definitions, and throughout this paper, we adopt the standard information theoretic convention 
• log(O) = 0, obtained by extending the function £ log(£) to the point £ = by continuity. Intuitively, 
the entropy H(X) is the amount of uncertainty in predicting X, or equivalently, the amount of information 
obtained by observing X. The conditional entropy H(X\Y) is the average uncertainty in predicting X given 
the observation of Y. These quantities satisfy the relations: 

< H(X\Y) < H(X) 

For a stationary process (Xt), the entropy rate is simply the asymptotic per symbol entropy. 

Definition 3. Let (X t ) be a discrete time stationary stochastic process over a finite alphabet X . The entropy 
rate h is: 

h = lim H(X?)/n , 

where X™ = X\, ...,X n is interpreted as a single discrete random variable taking values in the cross product 
alphabet X n . 
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Using stationarity it may be shown that this limit h always exists and is approached monotonically from 
above. Further, it may be shown, that the entropy rate may also be expressed as the monotonic limit of the 
conditional next symbol entropies h(n): 

h(n) \ h , 

where 

h(n) = H(X n \X^~ 1 ) . 

The block estimates H(Xi)/n can approach no faster than a rate of 1/n. However, the conditional block 
estimates h(n) = H(X n \X 7 ^ 1 ) can approach much more quickly, and are therefore generally more useful. 
One of our primary goals is to establish an exponential bound on the rate of approach for a suitable class of 
HMMs (defined below). 

2.2 Hidden Markov Models 

We will consider here only finite HMMs, meaning that both the internal state set S and output alphabet X 
are finite. There are two primary types: state- emitting and edge-emitting. The state-emitting variety is the 
simpler of the two, and also the more commonly studied, so we introduce them first. However, our primary 
focus will be on edge-emitting HMMs because the path-mergeability condition we study, as well as the block 
presentation of Section [4 . 2 . 1 1 used in the proofs, are both more natural in this context. 

Definition 4. A state-emitting hidden Markov model is a 4-tuple (S,X,T,0) where: 

• S is a finite set of states. 

• X is a finite alphabet of output symbols. 

• T is an \S\ x \S\ stochastic state transition matrix: Tij = P(5t+i = j\St = i). 

• O is an \S\ x \X\ stochastic observation matrix: Oi x = ¥(X t — x\S t = i). 

The state sequence (St) for a state-emitting HMM is generated according to the Markov kernel T, and 
the observed sequence (X t ) has conditional distribution defined by the observation matrix O: 

a) 

P(JC = CISo 00 = So°°) = = CIS™ = C) = II °^ > 

t—n 

where we denote X™ = X n X n+ i...X m and S™ = S n S n +i...S m for integers n < m, and extend in the natural 
way to the case m = oo. 

An important special case is when the observation matrix is deterministic, and the symbol X t is simply 
a function of the state St- This type of HMMs, known as functional HMMs or functions of Markov chains, 
are perhaps the most simple variety conceptually, and also were the first type to be heavily studied. The 
integral expression for the entropy rate provided in [3] and exponential bound on the convergence of the 
estimates h(n) established in [T5] both dealt with HMMs of this type. 

Definition 5. A functional hidden Markov model is a state- emitting hidden Markov model for which the 
observation matrix O is deterministic: Oi x = f or some function f : S — >• X . It is canonically 

represented as a 4-tuple (S, X,T, /). 

Edge-emitting HMMs are an alternative representation in which the symbol X t depends not simply on 
the current state St but also the next state S t+ i, or rather the transition between them. 

Definition 6. An edge-emitting hidden Markov model is a 3-tuple (S,X,{7~( X }) where: 
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• S is a finite set of states. 

• X is a finite alphabet of output symbols. 

• T {x \x e X are \S\ x \S\ sub- stochastic symbol-labeled transition matrices whose sum T is stochastic. 
7y is the probability of transitioning from i to j on symbol x. 

Visually, one can depict an edge-emitting HMM as a directed graph with labeled edges. The vertices are 
the states, and for each i,j,x with 7y > there is directed edge from i to j labeled with the transition 

probability P = %) and symbol x. The sum of the probabilities on all outgoing edges from each state is 1. 

The operation of the HMM is as follows: From the current state St the HMM picks an outgoing edge E t 
according to their probabilities, generates the symbol X t labeling this edge, and then follows the edge to the 
next state St+i- Thus we have the conditional measure: 

¥(S t+ i=j,X t = x\S t = i,St l = so" 1 .^ -1 = &o -1 ) = p (&+i = 3, x t =x\S t = i) = 

for any % G S, t > 0, and possible length-i joint past (sq _ ,Xq _ ) which may precede state i. From this it 
follows, of course, that the state sequence (St) is indeed a Markov chain with transition kernel T = XL T~ ■ 

Remark. It is implicitly assumed (for a HMM of any type) that each symbol x 6 X may be actually be 
generated with positive probability. That is, for each x 6 X , there exists i € S such that ¥(Xo = x\Sq = i) > 0. 
Otherwise, the symbol x is useless and the alphabet can be restricted to X/{x}. It also assumed, throughout, 
that the state set S and output alphabet X both have size at least two. Otherwise, the conclusions are 
essentially trivial in all cases, but some of the proofs and definitions may be ambiguous as they are written. 

2.2.1 Irreducibility and Stationary Measures 

A HMM, cither state-emitting or edge-emitting, is said to be irreducible if the underlying Markov chain 
over states with transition kernel T is irreducible. In this case, there exists a unique stationary distribution 
7r over the states satisfying tt = ttT, and the joint state-symbol sequence (St, X t )t>o with initial state So 
drawn according to tt is itself a stationary process. We will henceforth assume all HMMs arc irreducible, 
and denote by P the (unique) stationary measure on joint state-symbol sequences satisfying So ~ 7T. 

This measure P will be our primary focus. However at times, in particular for studying the rate of 
memory loss in the initial condition, we will also consider the situation in which the initial state So is chosen 
according to some alternative distribution. We denote by Pi the measure on the joint sequences (St, X t )t>o 
given by fixing So = i, and by P M the measure given by choosing So according to the distribution pi: 

Pi(-) = P(-|S = i) and P M (.) - Yl W p *(0 

i 

These measures P, Pi, and P M are, of course, also extendable in a natural way to biinfinite sequences 
(S t ,X t )tez as oppose to one-sided sequences (S t ,X t )t>o, and we will do so as necessary. 

2.2.2 Equivalence of Model Types 

Though they are indeed different objects state-emitting, edge-emitting, and functional HMMs are all equiv- 
alent in the following sense: Given an irreducible HMM M of any of these types there exists an irreducible 
HMM M' of any of the other types such that stationary output processes (Xt) for the two HMMs M and 
M' are equal in distribution. The equivalence is also constructive in that M' can always be constructed from 
M. We recall below the standard conversions. 

1. Functional to State- Emitting - Since a functional HMM is a state-emitting HMM (with deterministic 
observation matrix) no conversion is necessary. 
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2. State- Emitting to Edge- Emitting - If M = (S,X,T,0) then M' = (S, X, {T' {x) }), where T-j x) = 

3. Edge-Emitting to Functional - If M = (S,X,{T (x) }) then M' = (S',X,T,f), where S' = {(i,x) : 
E rf 7f } > 0}, T{ i!x)i . y) = (t^/E^) • (LV). and /'(*,*) = x. 

By composition of these three conversion algorithms one may convert from any of the HMM varieties to 
any of the other varieties. This equivalence of model types, and specifically the conversion algorithms, will 
be useful when considering extensions of our results for edge-emitting HMMs to state-emitting HMMs in 
Section [5j 

2.2.3 Path-Mergeability 

For a HMM M, let 6i(w) be the states j which state i can transition to upon emitting the word w. 

Si(w) = {j £ S : P i (Xi u,|_1 = w, S\ w \ = j) > 0} , for an edge-emitting HMM 
8j(w) = {j £ S : P^xj™ 1 = w, S\ w \ = j) > 0} , for a state-emitting HMM 

where \w\ denotes the length of w. In either case, if w is the null word A then 6i(w) = {i}, for each i. The 
following two properties will be of central interest. 

Definition 7. A HMM is said to have path-mergeable states ( or be path-mergeable ) if for each pair of 
states i,j there exists some word w and state k such that it is possible to transition from both i and j to k 
on w: k £ 6i(w) and k £ 8j(w). 

Definition 8. A HMM is said to be state-collapsible if for each state k there exists some symbol x, such 
that if the symbol x is observed then it is possible to collapse to state k at the next time step, regardless of 
the previous state distribution: k £ di(x), for all i with 8i(x) ^ {}. 

Our end goal in Section [4l below, is to prove exponential bounds on convergence of the entropy rate 
estimates h(n) and rate of memory loss in edge-emitting HMMs with path-mergeable states. To do so, 
however, we will first similar prove similar bounds for edge-emitting HMMs under the state-collapsible 
hypothesis, and then bootstrap. As we show in Section |4. 2. 1[ if an edge-emitting HMM has path-mergeable 
states then some "power of it" is state-collapsible. Thus, exponential convergence bounds for state-collapsible 
(edge-emitting) HMMs pass to exponential bounds for path-mergeable (edge-emitting) HMMs by considering 
block presentations. In Section [5] we will also consider similar questions for state-emitting HMMs. In this 
case, analogous convergence results follow easily from the results for edge-emitting HMMs by applying the 
standard state-emitting to edge-emitting conversion. 

2.2.4 Loss of Memory 

Let 4>i(w) denote the probability distribution over the current state given that the initial state Sq is state i, 
and the first \w\ symbols of output for the HMM are, in fact, the word w: 

4>i(w) = PiiS^lX 1 ^ 1 = w) , for an edge-emitting HMM 
4>i(w) = Pi{S\ w \\x[ wl = w) , for a state-emitting HMM 

Also denote, analogously, </> M (if) as the conditional distribution over the current state given that the initial 
state So is chosen according to the distribution /x, and the first \w\ output symbols are the word w. And, 
let 4>(w) = (p n (w). In the case that that the word w cannot be generated from the initial state So — i (or 
distribution /it), we take, by convention, <pi(w) (or (f>^(w)) to be the non-distribution consisting of all 0s. 
An important property of HMMs is the so called forgetting of the initial condition or loss of memory. 
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Definition 9. An edge-emitting HMM is said to a.s. forget its initial condition if for each initial state i: 

\MXq~ 1 ) - <K*o -1 )Hw -> , P, a.s. 

where \\fi — v\\tv = 5 — ^||i is the total variational norm of two finite probability distributions /i = 
(/ii, fi n ) and v = (yi, i> n ). An edge-emitting HMM is said to a.s. forget its initial condition at an 
exponential rate if there exists some < a < 1 such that for each initial state i: 

limsupH^pr*- 1 ) -<K*o -1 )Hrv - a ' P * a ' s ' 

t— >oo 

Definitions for state- emitting HMMs are analogous with Xq replaced by X\. 

Forgetting of the initial condition is closely related to convergence of the finite-block entropy rate estimates 
h(n) for the stationary output process of a HMM. Indeed, for any n G N we have: 

h > H(X n ^\So,X^ 2 ) , for an edge-emitting HMM 
h > HiXnlSoiXl 1 - 1 ) , for a state-emitting HMM 

Thus: 

h(n) -h< HiX^X™- 2 ) - H(X n ^ \S , X 1 ^ 2 ) , for an edge-emitting HMM (1) 
h(n) -h< HfXnlX?- 1 ) - H(X n \S , X^ 1 ) , for a state edge-emitting HMM (2) 

If the initial state So is almost forgotten (with high probability) after the first n — 1 output symbols then the 
next symbol does not depend too much (with high probability) on the initial state Sq, and the differences 
in the averaged conditional entropies on the right hand sides of Equations [1] and [2] are each small. 

Remark. An equivalent definition (used, for example, in fffffl for state- emitting HMMs) is to require conver- 
gence (or exponential convergence) of the total variational distance ||</>„(Xj) — (^^(XDWtv , o,.s. for any 
initial state distribution fi with full support and arbitrary initial state distribution v. A HMM will a.s. forget 
its initial condition (at exponential rate a) in this sense, if and only if it a.s. forgets its initial condition 
in the sense of Definition^ (at exponential rate a). A stronger condition is to require a.s. convergence, or 
exponential convergence, for any two initial distributions fx, v without requiring fj, to have full support. We 
feel, however, that this is too stringent a requirement, since it implies that the set of allowed future sequences 
which can be generated from each initial state i is the same. 

2.2.5 Additional Notation 

For a HMM M with output alphabet X and word w £ X* we define: 

Pi(w) =P 4 (X "' hl =w) 

P(w) =P(X "' hl =w) 

The process language C(V) for the output process V = (X t ) of the HMM is the set of words w of positive 
probability. jCl(V) is the set of length-!, words in the process language. The support of the process V, 
denoted supp('P), is the set of all bi-infinite sequences such that every finite subsequence is in the process 
language. Finally, supp("P~) is the set of all infinite pasts sequences such that every finite subsequence is in 
the process language. 

C{T) 

supp(P) 
supp(P~) 
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= {w G X* : P(w) > 0} 

= {we C(V) : M = !} 

= {x^ : x™ G C{V) for all n < m) 

= {xzlo : x™ G C(V) for all n < m < -1} 



For w 6 £(V), S(w) is the set of states which can generate w, and S(w,j) is the set of states which can 
transition to j on w. We will only need to apply these definitions for edge-emitting in which case they may 
be expressed as: 



S(w) = {ieS: P ? 0) > 0} (edge-emitting) 

S(w,j) = {ieS: F^X 1 ™ 1 ' 1 = w, S\ w \ = j) > 0} (edge-emitting) 

We will also use the following (perhaps slightly nonstandard) notation for distributions of random variable 
blocks and conditional distributions. 

• The distribution of the block of random variables X™ according to the stationary measure P is denoted 
by P(A™), and similarly Pi(X™) and P Al (X™) denote the distributions of the random variable block 
X™ according to the measures Pi and P M . Thus, a statement like Pj(Jf™) = Fj(X™) means that the 
distribution of the random variables X™ from initial states i and j are equal. A similar notation is 
used for distributions of state sequence blocks S™ and joint state-symbol blocks (S 1 ™, -X"™). 

• As in the definition of <fii{w) for an edge-emitting HMM, F^S^X^ 1 = Xq -1 ) is the conditional distri- 
bution of the random variable St given that the initial state is S = i and X^ 1 = x^ 1 . F i (S t \X t x ) 
denotes the distribution- valued random variable for the conditional distribution over the state St given 
initial state i and (random) output sequence Xq^ 1 . 

• x™ and s™ will normally be used to represent realizations of the random variable blocks X™ and 5™, 
but they should be thought of more generally just as length-(m — n + 1) symbol or state sequences 
without a specific starting point in time. So, for example, Fi(x™) = Pi(X™~ n = x™), applying the 
definition of fi(w) to the word w = £™. (This could also be interpreted as Pj(x™) = Pi(A™ = x™), 
but this is NOT what we mean, unless n = 0.) 

3 Auxiliary Spaces 

The random variables St and X t for a HMM are assumed to live on an underlying probability space (0, J 7 , P). 
We define now two important auxiliary probability spaces that will be useful in our proofs later on. Through- 
out Section [3] we assume that M = (<S,A?,{T<»}) is a state-collapsible, edge-emitting HMM, and denote the 
special state-collapsing symbol for state k by fee 6i(yk), for all i with 6i(yk) ^ {}■ 

3.1 The Pair Chain Coupling Space 

For fixed states k, k and a symbol sequence Xq £ X t+1 (t 6 N) such that Pfc(xp) > and Pj(xq) > the 
pair chain coupling space (0, J-, P) is defined as follows: 

• f2 = {(?~o +1 '^o +1 ) : r Ti r T & S for < t < t + 1} is the set of length-(£ + 2) state sequence pairs. 

• J- is the discrete er-algebra on (i.e. all subsets of are measurable). 

• The measure P on state sequence pairs (?"o +1 i^o +1 ) ^ fl is the pull back measure of the time- 
inhomogeneous Markov chain (R T , R r y^} with initial distribution 



P(R Q =k,Ro = k) = l 



and transition probabilities 



P(i? r+ i = j, R T+1 = j\R T =i,Rr = i) 

r P(5 T+ i=i|S' r = i,X*=x*).P(5 r+1 =ilS r = i;Z*=4), if £ ^ i 
= { P(S T+ i=j\S T = i,X t T = x t r ), ifi=iandj=j 
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for < r < t. 

By marginalizing it follows that the state sequences (R T ) and (R T ) are each individually (time-inhomogeneuos) 
Markov chains with transition probabilities 

F(R T+1 =j\R T = i)=F(S T+1 =j\S T =i,X t T = x t T ), and 

p(it+i =1\Rt = *) =Hs T +i =3\s T =Xx t T = x t T ). 

Hence: 

P(i?* +1 = r* +1 ) - P(5* +1 = r* +1 |5 = = **), and 

P(i?* +1 = f* +1 ) = P(S< +1 = f* +1 |5 = = 4). 

So we have the following coupling bound: 

||<M4) - <^(4)Hrv = ||P fe (^+il^o - xD-FtfSt+M = xl)\\ TV 

< F(R t+1 ^ R t+1 ) (3) 

3.2 The Reverse-Time Generation Space (Cl,J r ,¥) 

Fix a state k, and assume without loss of generality (up to relabeling) that the output alphabet is X = 
{1,2,..., with %ik = 1. We construct the reverse-time generation space (fi, T, P) and random variables 
{X t )t£-n for state k as follows: 

• (C^n)neN and (Vn) n gN are i.i.d. sequences of uniform([0, 1]) random variables independent of one an- 
other. 

• (0, J 7 , P) is the canonical probability space (path space) on which the sequences (U n ) and (V n ) arc 
defined. 

• On this space P) we define random partitions Pt,i € — N of the interval [0,1] and random 
variables X t ,t G — N inductively as follows: 

1. P-i = {I*i ■ x G .Y} where each 7f, 1 is an interval of length P(X_i = x), and the intervals are 
consecutively placed on [0, 1] and closed at the right endpoint. For example, if P(X_i = 1) = 1/2, 
Pp£T_i = 2) = 1/3, and P(X_i = 3) = 1/6 then = [0,1/2], 7^ = (1/2,5/6], and ll 1 = 
(5/6, 1]. X-i is defined by X_j = x -O- J7 X G J* x . 

2. Conditioned on = x^ (for f < -2), P t = {If : x G X} where the Jf's are intervals of 
length Ppft = xl^T^j.! = x^) placed consecutively on [0,1] and closed at the right endpoint, 
and: 

- If t + 1 = T* for some n, then X t is defined by X t = x •<=>• V n G if. 

- Otherwise, X 4 is defined by X f = x <^> f/_ t G If. 

Here the random stopping times Tj™, n G N arc defined by: 

- T-f = max{< < -1 : PpQ" 1 = X^St = k) > F(X^ = X t _1 |St = j), for all j} 

- T 2 fc = max{t < T-f : P(X ( _1 = X^ 1 ^ = jfe) > P^ 1 = Xf^St = j), for all j} 

- fj 5 = max{t < 7 2 fc : Ppf^ 1 = X^ 1 ^ = k) > P(X t ~ a = X^ x \S t = j), for all j} 

If there is no such t for some n > 1, then T^' = — oo and T,^ = — oo for all m> n. 
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By induction on the length of w it is easily seen that P ^X_^ = tuj = P [X_^ = ivj for any word w G X* 



Hence: 



(^t)te-N = (Xt)te-N (4) 



4 Results for Edge-Emitting HMMs 

With the necessary preliminaries established, we now proceed to the proofs of exponential convergence of 
the finite-block entropy rate approximations and exponential rate of memory loss for edge-emitting HMMs 
with path-mcrgeable states. The basic structure of the arguments is as follows: 

1. We establish exponential bounds for state-collapsible HMMs. 

2. We extend to path-mergeable HMMs by passing to a power machine representation (see Section l4.2.ip . 
Exponential convergence for state-collapsible HMMs is established by the following steps: 

(i) Using large deviation estimates on the (f2, J 7 , P) space we show that there exists a set of "good" length- 
(t + 1) sequences Gt of combined probability 1 — O (exponentially small), such that for each x Q G Gt 
there is some state k with N k (x ) > C\ ■ t. Here C\ > is a constant (depending on the HMM, but not 
on t), and 

iV fe (4) ee |{0 < r < t - 1 : x T = j/ fe ,P fc « +1 ) > P 3 (4 +1 ) for all j}\ . (5) 

(ii) Using a coupling argument similar to that given in [TJ] we show that ^^(xq) — 4>j:( x o)\\TV is exponen- 
tially small, for any sequence Xq G G t and states k, k with Fk^x^) > and Pj(xq) > 0. 

(iii) Applying the Borel-Cantelli Lemma we conclude from (i) and (ii) that the initial condition is a.s. 
forgotten at an exponential rate. 

(iv) Using (ii) we also show that the difference H(X t +i\XQ — Xq) — H(X t +i\XQ = Xq, So) is exponentially 
small for any Xq G Gt- 

(v) Using (iv) and the fact that P(G£) is exponentially small we show that the difference H(X t +i\XQ) — 
H(X t +i \Xq, So) is exponentially small. 

(vi) Finally, using (v) and the fact that H(X t +i\XQ, Sq) < h, for all t we conclude that the differences 
h(t) — h must be exponentially small. 

4.1 Under State-Collapsible Assumption 

Throughout Section 14.11 we assume is a state-collapsible, edge-emitting HMM, and 

denote the special state-collapsing symbol for state j by yf j G 6i(yj), for all i with 5{(yj) ^ {}. We define 
also the quantities: 

Pj ee P(Xo = yj\S\ = j) and ee min»j 

j 

cjj = min P.; (Si = j\Xo — ijj) and = min^j 

r* EE mill TTi/TTj 

i,3 

Note that p*, q t , and r* are all always strictly positive. 
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4.1.1 Large Deviation Estimates for the Space P) 

Let the random variables Z n ,n <E N and Z%" 9 ,n e N on (Q, J\P) be defined by: 

1, if K < P*r*/\S\ , 
0, else 



and 



1 n „ 



n 

m—l 



Also, define the random fractions F„, n G N by: 

I c/n, if T% > — oo where c = \{t < — 1 : t = for some 1 < m < n and X t _i = yk}\ 
Lemma 1. P (z%"> < ^) < a™, where a x = exp (-f^l) < 1- 

) ,- r 



Proof. The Z n arc i.i.d. with < Z n < 1 and EZ n = ^jgf-. Thus, applying Hocffding's inequality to the 
i.i.d. sequence Z n yields: 



gavg < P* r * ) _ p ( ga-vg _ P* r * <- P* r * 



2\S\J V " " 2|5| 



< exp 



_2 ( P* r >Y 
(1-0)= 



□ 



Lemma 2. Lei u> G Cl(V) be a word such that: 

P(XZl = w\S- L = k) > P(XZl = w\S- L = j) , for all j 

Then: 

(i) P(S-l = k\XZl = w) > r* • P(S*- L = j\XZl = to) , /or a// j 
(it) P(S_ L = 4Xr£ = 10) >r*/|S| 



Proof. 



\XZI = w\S- L =k)> P{XZl = w\S- L = j) 



, i x Ppf ~r = w) , , ¥{X-l = w) 

This proves (i), and (ii) follows. □ 
Lemma 3. IfT^ n > — oo and Z m = 1, then Xf k = ijk- 
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Proof. Note that for any t < — 1 the event {T^j = t} depends only onl, 1 . Define: 



C k m = {xt 1 e £(P) : t < -1, and A^ 1 = 



-1 -1 . ^k 



Then for any x t 1 € we have by Lemma [5J 



(x t _x = z/fci^- 1 = x,- 1 ) > p(5 t - fcix,- 1 - x t - ) • W-i = i/fcl-s* = fc) 



But, 



Thus, 



X" 1 - x,- 1 and Z m = 1 Kn e [0,P(* t -i = ^IX- 1 = a^ 1 )] = 

-X't-i = Vk 

Since this holds for any x^ 1 S the claim follows. 
Lemma 4. > -oo and > then F% > 

Proof. Assume T„ > — oo. Then, by LcmmaEl A,^ fc = j/£ for each m G {1, n} with Z m = 1. So: 

1 



F 



> 



n 

2 av 9 



{1 < m < n : Xf^ _ 2 = J/fc} 
{1 < m < n : Z r „ = 1} 



□ 



Hence: 



T* > -oo and > 



({?* >-oc andF*<|^}) <ay 



Lemma 5 

Proof. By Lemmas [T] and 2] we have 



T* > —oo and F* < 



2|5| 



2|5| " 2|5| 



T 7 * > -oo and Z„ m ' 9 < |g 



< 



<«7 



2|S| 



□ 



□ 
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4.1.2 Bound from Below on P(G t ) 

The random variables T% and F% on (CI, F, P) depend only on the symbol sequence (X t )te-a- T% = T*(XZ^) 
and F% = F^Xzlo). As such, we may define corresponding random variables T% and F„ for the standard 
HMM probability space (fi, J 7 , P), depending on (X t ) te _N. Since (X t )t e _N = (A t ) te _N we have, necessarily, 

(T^)„ e N = (T^)„ e N and (F*)„ e N = (i^)„ e N- This will be quite useful in the following development. 

Before proceeding, however, we first need to introduce some more notation and terminology. We say 
xZ\o is an extension of the word w if X Z\ W \ = w - We define the constant c\ = 2\s \ 2 ' an< ^ '^^ we define 
n t = \t/\S\~\ and N^Xq) as in Equation ((5J). Finally, we define the following sets: 

A* = jxlL G supp(p-) : TfozU > -oo and i^^I^) < |g| , fc € S and n e N 
B„ = Lzlc £ supp(p-) : i^(xl^) > for all k with T*(xlL) > -°°} , » G N 

C t = jaT^ G supp(p-) : there exists k with 2* (xljj > -f and F£(aCiJ > , t G N 

C t ' = {xl t _ 1 e £t+i(7 5 ) : extensions of xZ t -\ ar e in G t } , (eN 
G t = {x f £ £ t +i("P) : there exists fc with A^q) > Cii} , t G N 



Note that C t ' and Gt may both be considered simply as sets of length- (i + 1) words w. For C[ there is the 
added interpretation of the words as length-(t+l) past sequences, and for Gt there is the added interpretation 
of the words as length-(i + 1) future sequences. However, C[ and Gt are both well defined simply as sets 
of words. As such, we may reasonably ask questions about when one set is contained in another, and the 
probability of these sets as the combined (stationary) probability of all words in the sets. 

Lemma 6. For any n G N: 

(i) P(A*) < a%, for each k 
(ii) ¥{B n ) > 1 - \S\a% 

Proof, (i) Follows from Lemma O and Equation (|U). (ii) follows from (i) and the fact that the compliment of 
the set B n is B^ = \J k A 1 ^. (Note: Compliment here means compliment with respect to the set of sequences 
supp(7 ,_ ), i.e. B n U B c n = supp(7 ,_ ). The combined probability of all other sequences is 0, so they can be 
ignored.) □ 

Lemma 7. Gt 2 B nt 

Proof. For any xZ^ G supp^") there is some k such that .(aC^) > —t. And, for any xZ^ G B nt with 
Tn t (xZlo) >—t> -oo, we have F^ t (xZlo) > 2TST> wmcn implies xZ^ G G t . □ 

Lemma 8. G t D C[ 

Proof. Let w £ C' t . Then there exists k such that: 



2|5| 



>-i and ^ t (aCL) > 
for any extension xZ^ of w in supp^ - ). It follows that: 

which implies w £ Gt- □ 
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Lemma 9. For any t G N, P(G t ) > 1 - \S\ot\, where a 2 = a\ n 1 < 1. 
Proof. By Lemmas [6l [7j and [8] we have: 

P(G t ) > P(C£) = ¥(C t ) > P(B nt ) > 1 - \S\a? > 1 - \S\a\ 

□ 

4.1.3 Pair Chain Coupling Bound 

Assume now, without loss of generality, that the state set is S = {1, 2, \S\}, and for x* G Gt let the index 
1(xq) and time set r(xg) be defined by: 

I = min{fc : Nk(xl) > c\t} 

r = {o< T <t-i:x T = w,Pi(4+i) > p i(4+i) for a11 j} 

Lemma 10. For any Xq G Gt, r G T(xq), and i G <S luit/i P,(4) > : 

P(5 T+ i = i|S r = i,Jf*=4)>g» 
Proo/. Fix G G t and r G T(xq). For any state i with P»(4) > we have that P,(ar r ) = P,(y;) > 0, and: 

nK+i = 4+i ISr = *,*r = 2/0 < p(x* +1 = 4 +1 |5 r+ i = o 

Thus: 

V(S T+1 = l\S T = i,X t T = x t T ) 

F(S T+1 = l,S T = i, X T = yuK+i = 4+i) 
P{S T =i,X T =y l ,Xt +1 =x t r+1 ) 

= ns T = i,x r = yi )- ¥(s T+1 = i\s T = t,x T = y {) ■ nK+i = A+i\s T +i = i) 

¥(S T = i,X T = Vl )- nK+l = 4+llSr =i,X T = Vl ) 

>¥[S T+1 = l\S T = i,X T = yi ) 

□ 

Lemma 11. For any Xq G Gt and states k, k with Pfc(xg) > and P^Xg) > : 

\\¥ k (S t+1 \Xl = 4) - P % (S t+1 \X f = xl)\\ TV < at , 

where as = (1 — g») ci < 1. 



13 



Proof. Applying the pair chain coupling bound ([3]) wc have: 

< P(R t +i ? Rt+i) 



\[v{r t+1 ^r t+1 \r t ^r 

T=Q 

< I] P(i?r + l^i?r+l|i?r^i?r) 

+ 1 7^ -Rt + iI-Rt = i,Rr = l\ 
1 -P (Rt+1 = Rr + 1 =l\Rr= i,Rr =1 



rer(4) 
< max 



< 



n 



(a) „ 

< n c 1 -^) 

rer(4) 

(6) 

< (1-9* 2 ) C1 * 

= 4 - 



where (a) follows from Lemma ITUl and (b) from the fact that |r(xg) | > c\t. (Note: We have assumed here 
that ¥(R T 7^ R T ) > 0, for all < t < t. If this is not the case the conclusion follows trivially.) 



□ 



4.1.4 Forgetting of the Initial Condition 

Lemma 12. Let Xq be a length-it + 1) sequence with Pj.(xq) > 0. Then: 

||P fc (5 t+1 |JC* = 4) - P(S t+1 |X« = < .max ||P fc (5 t+1 |X* - s*) - P s (5 t+1 |X* = 

Proof. It is equivalent to prove the statement for 1-norms, in which case we have: 
\\¥ k (S t+1 \X f = 4) - ¥(S t+1 \Xo = 4)h 



fe (5 t+1 |x* = 4)- ns = k\x t = 4)-¥ js (s t+1 \x t = 4) 

< £ ¥(S = k\X t = xl)-\\¥US t+ i\Xl = xl)-F % (S t+1 \X t = x t )\\ 1 
kes{x f ) 

< max HPfc^t+ilJf^arSJ-P^t+ilJC^agJHi 
fceS(4) 



□ 



Theorem 1. ^4n?/ state-collapsible, edge-emitting HMM forgets its initial condition a.s. at exponential rate 
03 or faster. For each state i: 

limsup UiiX*- 1 ) - ^(X*- 1 )!!^ < a 3 , P, o.fl. 

i— > oo 

Proof. Since P(-) = 7TjPi(-), we have Pi(E) < ¥{E)/iti < ¥{E)/ (mim, ttj) for any state i and measurable 
event E. Hence, by Lemma [9], we have for any state i: 



Pj(Gj) < c 2 a t 2 , where c 2 



151 
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So, by the Borel-Cantelli Lemma, Pi ({a;g° : Xq 6 G% i.o.}) = 0. And, for x^ <E G t with Pj(xo) > we have 
by Lemmas [TT] and [12) 

||^(4) - ^{xDWtv = m{S t +x\Xl = x\) - P(5 4+ i|X* = 4)|| T v 

< max \^ i {S t+ x\Xl = x\)-^ 1 {S^i\Xl=x\)\\TV 

< max \\F k (S t+1 \XZ=x t )-F 1 ;{Si+i\XZ = x t )\\Tv 
fe,fee5(x*) 

The claim follows. □ 
4.1.5 Convergence of Entropy Rate Approximations 

Lemma 13. Let (i = (/xi, fj, n ) and v = (fi, &e iwo probability measures on a finite set {1, ...,n}. 
// ||/Lt — ^||tv < e /or some < e < 1/e i/ien: 

<nelog 2 (l/e) 

Proof. Recall that, for a logarithm of any base, we use the convention • log(0) = 0, obtained by continuous 
extension of the function xlog(x) to the point x = 0. Let us also define • log(l/0) = 0, by continuous 
extension of the function xlog(l/x) = —xlog(x) to the point x = 0. 

Further, applying these conventions, let us define for any < e < 1 the function f e (x) : [0, 1 — e] — > R. by: 

f e (x) = (x + e) ln(.x + e) — xln(x) 

It is easily checked that for e e [0, 1/e]: 

max |/ e (.x)| = |/ e (0)|=eln(l/e) 

xe[0,l-e] 

Now, if fi and v are two distributions such that \\fi — ^||tv < e, for some < e < 1/e, then \fik — ^k\ < e for 
all k. Using this, the bound on |/ e |, and the fact that g(x) = xhi(l/x) is increasing on [0, 1/e] we have: 



\H{n) - H{v)\ = 



^2 Vk lo g2(^fc) - l0g 2 (/Xfc 



< log 2 (e) • ^2 Wk ln(^fc) - [ik ln(pt fc )| 

k 

< nlog 2 (e) • max max \(x + e) ]n(x + e ) — xhx(x) 

e'S[0,e] ie[0,l-E'] 

= nlog 2 (e)- max max \f e >(x)\ 

e'G[0,e] xG[0,l-e'] 

< nlog 2 (e) max e'ln(l/e') 

6'e[0,e] 

= «log 2 (e) ■ eln(l/e) 
= nelog 2 (l/e) 



□ 



Lemma 14. For < a < 1, let to = to(a) = |~log Q (l/e)] . Then for any t > to and any two probability 
measures /i = (pL\, pL n ) and v = (yi, v n ) on the set {1, n} with \\fi — u\\tv < a * : 

|i?(/i)-lf^)|<-nlog 2 (a)-fa* 
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Proof. Let \i and v be two distributions with \\x — v\tv = for some t > to- Since a* < 1/e for t > to, 

and the function xlog 2 (l/a;) is increasing on [0, 1/e], we have by Lemma [T3l 

|fT(p)-ir(i/)|<nelog a (l/c) 

< na* log 2 (l/a') 
= — nlog 2 (a) • tee* 



Lemma 15. Let and v be two distributions on 5. Then \\¥fj.(Xo) — P 1 /(Xo)||tv' < — ^Htv- 
Proof. It is equivalent to prove the statement for 1-norms. In this case we have: 



||p M pr )-P„(Xo)||; 



^ Mfc -p fc (Xo)-^i/ fc -p fc (x : 

k k 

<]TK--^H|Pfc(Xo)ll 1 

k 

= IIm-HIi 



Lemma 16. For t > to — ^0(^3) (a& defined in Lemma \T]$ and x$ G Gt'. 

H{X t+l \Xl = 4) - ff(X 4+1 |5 ,X* = x*) < -\X\ log 2 (cv 3 ) • ta| 

Proof. Let 4 £ Gt with t > to- By Lemma ITTI we have: 

||P fc (5 t+ i|X< = **) - P £ (5 t+1 |X* = 4)|| TV < 4 . for all k, k G 5(4). 

Thus, by Lemma [T21 

||P fe (5 t+ i|X* = 4) - P(St+i|JfS - xDWtv < 4 . for all k G 5(4). 

By Lemma [151 this implies: 

||P fe (X t+1 |X* = 4) - P(X t+1 |X* = 4)|| T y < a* , for all k G 5(4). 

Hence, applying Lemma [14] we have the bound: 

ff(X i+1 |X = 4) - H(X i+1 |X* = 4, S = k) < -\X\ log 2 (a 3 ) • tal , for all k G 5(4). 

The claim follows from this bound and the fact that: 

H{X t+1 \Xl = 4) - H(X t+1 \S ,Xl = 4) 

= Y. nSo = k\X t Q = x t )-(H(X t+1 \X t Q = x t )-H(X t+1 \X t = x t Q ,S = k)) 



□ 



□ 



□ 



Lemma 17. Let = max{a 2 ,«3}. Then limsup t _ J . 00 {7?(X t+ i|Xo) — H {X t +\ \Xq, Sq)} 1 ^ < 



CK4. 
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Proof. Applying Lemmas [51 and [TBI we have that for all t > to = to (0:3): 

H(X t+1 \X t )-H(X t+1 \X t ,S ) = nx t o)-[H(X t+1 \X*=x t )-H(X t+1 \X*=x t ,S )] 

< P(G t ) -{-\X\ log 2 (a 3 ) ■ ta\) + P(G t c ) • log 2 \X\ 
<l-{-\X\ log 2 (a 3 ) -tal}+ \S\at • log 2 |*| 

The claim follows directly from this estimate. □ 
Theorem 2. For any state- collapsible, edge- emitting HMM: 

limsup {h(t) — h} 1 ^ 1 < on 

Proof. For any t 6 N: 

h= lim H{X \XZI)> I™ H{X Q \XZl, 5_ (t+1) ) = ff(X |^ t+1) ,S_ (t+1) ) = ff(X t+ i|X$, S* ) 

r— ►oo r— J-oo 

Thus: 

fc(t + 2) - h = H(X t+x \Xl) -h< H(X t+x \Xl) - H{X t+ i\Xl S ) 
The claim follows directly from this inequality and Lemma [T7l □ 

4.2 Under Path-Mergeable Assumption 

Building on the results of the previous section for state-collapsible HMMs, we now proceed to the proofs 
of exponential convergence of the entropy rate approximations and exponential rate of memory loss for 
path-mergeable HMMs. The general approach is as follows: 

(i) We show that for any path-mergeable HMM M there is some some n € N such that the power machine 
M n (defined below) is state-collapsible. 

(ii) We combine (i) with the exponential convergence bounds for state-collapsible HMMs (Theorems [1] and 
[2]) to obtain the desired bounds for path-mergeable HMMs. 

4.2.1 Power Machines 

Let M = (S, X, {T^-*}) be an edge-emitting hidden Markov model with probability measure P on its output 
and internal state sequences. The n-block model or power machine M n is the triple (S,W, {Q [w) }) where: 

• W = C n (V) is the set of length-n words of positive probability. 

• QiJ^ = ^%(Xq ~ 1 = w, S n = j) is the n-step transition probability from i to j on w. 

It can be shown that if M is irreducible and n is relatively prime to the period of M's graph, then 
M n is also irreducible. Further, in this case M and M n have the same stationary distribution 7r and the 
output process of M n is the same (i.e. equal in distribution) to the output process for M when the latter 
is considered over length-n blocks rather than individual symbols. The following important lemma allows 
us to reduce questions for path-mcrgcablc HMMs to analogous questions for state-collapsible HMMs by 
considering such block presentations. 

Lemma 18. If M is an edge-emitting HMM with path-mergeable states, then there exists some n € N such 
that the power machine M n is state- collapsible. 
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Proof. The proof is by explicit construction. Let us denote M = (S, X, {T^ x '}) and assume without loss of 
generality (up to relabeling) that S = {1, 2, <S|}. Also, for each pair of states i,j denote by w%j and kij 
the special word w and state k in the definition of path-mcrgcability, so that kij £ 8i (w^ ) and kij £ 6j (wij ) . 
Additionally, for each state i, let wu = A (the null word) and ka = i. The base collapsing word t>* and base 
collapsing state i* are defined inductively by the following algorithm: 

(i) t := 0, Vq := A, io := 1, TZ := S/{1} 

(ii) While TZ ^ {} do: 

k t := min{fc : k £ TZ} 
it ■= min{j : j £ 5 kt (v t )} 
w t := tu^ 
Ui+i := v t w t 
H+l ■= fci t j t 

K:=n/{{1 :Fi(v t+1 ) = 0} U {fc t }) 
t:=t + l 

(iii) := v t ,i* := it 

At each iteration of the loop in step (ii) the set TZ loses at least 1 member, so the loop must terminate 
after a finite number of steps. If i>* and i* are the word and state in which it terminates, then clearly by 
the construction i* £ 5i(v*). In addition, since each state j ^ 1 must be removed from the set TZ before the 
loop terminates, we know that for each state j ^ 1, either Wj(v*) — or i* £ <5j(i>*). Thus, all states which 
can generate i>* may collapse to z* upon generating i>*. 

The path-mergeable states condition implies aperiodicity of the HMM (that is, aperiodicity of the un- 
derlying Markov chain). Combined with irreducibility this implies that there exist words v' k , k <E S, all of 
some fixed length L, such that k £ 5i t (v' k ), for each k. So, for each state k the word u k = v t v' k satisfies: 



k £ Sj(u k ) , for all j with Fj(uk) 



> 0. 



Thus, the power machine M" with n = \u k \ = \v* \ + L is state-collapsible. Note that aperiodicity implies 
the power machine is well defined for any n £ N. □ 

Remark. The purpose of the above construction is simplify to verify that for any edge- emitting HMM with 
path-mergeable states some power of it will be state-collapsible. It is not to be taken as an optimal construction 
or a method for determining the minimal power n. It is relatively easy to check (at least if the number of 
states and symbols are both small) whether a given HMM is path-mergeable, and it is very easy to check that 
a given HMM is state-collapsible. Thus, in practice, if one wants actual numerical bounds on the convergence 
of the entropy rate estimates h(n) or rate of memory loss for a given HMM M , it is probably best to first 
verify that M is path-mergeable, and then simply construct successive power machines M 1 = M,M 2 ,M 3 , ... 
until the first n such that M n is state-collapsible. With the construction given above n will always be at least 
2, and quite often much larger than necessary. In the (not unusual) case that M is itself state-collapsible, 
the theorem of the previous sections can be applied directly, and the estimates given below using the power 
machine representation are not necessary. 



4.2.2 Forgetting of the Initial Condition 

Throughout Section 14.2.21 M = (S, X, {T {x) ) is an edge-emitting HMM with path-mergeable states, M™ = 
(S, W, {OS™'}) is the corresponding state-collapsible power machine given by Lemma fT8l (n > 2), and i £ S 
is a fixed state. 
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We take the joint state-output sequence (S t ,X t )t>o to be generated according to the measure P* for M, 
and define for a given output sequence the distributions: 

i> m = ^(x™™- 1 ) , meN 

$i, m = ^(x™"" 1 ) , m G N 

Also, we define m , // m , and i/ m to be the unique probability distributions on S satisfying the relations: 

where e m = ||^ m - V'i.mllTV- 

Lemma 19. Let < e < 1 be fixed, and let 9 and v be any two probability distribuitons on S. Define the 
distribution ijj by ip = (1 — e)6 + ev. Let w be a word of length L > 1 such that ¥^(w) > 0. Then: 

A U\ AU\ ( (l-e)PeM \ . , ePvH 



(l-e)P 9 (w)+eP^( W )y ^(l-ejPsW+fP^w), 

Proof. The proof is a straightforward calculation using Baycs Theorem and The Law of Total Probability. □ 

Lemma 20. Let t = (mn — 1) + T, for some mgN and r G {1, ...,n}, and Zet oe any infinite symbol 
sequence generated from initial state i. Then: 



11^(4) - <&04)IItv < max 



Proof. Applying Lemma [TO] we have: 

0(4) = ^(4) 



.rmni I (l-e m )Pe m (a;^„)+e m P„ m (a;5, lI1 ) 

, { t \( i m f % (i'„) 

9v m \ X mn) I (l-e m )F flm (a4J+e m P„ m «, J , 



(6) 



and: 



cj>i{x ) = ^ im (iJ 

0Mm (^mn) (~ 



(l-£ m )F 9m (4„) 



.( a: mJ+ e '™ 11 Vm( a; mJ 

P (m (x^ n )+«Qp,, TO (x*„„) 



(7) 



The claim follows by applying the triangle inequality to Q and (|7|) , and using the fact that total variational 
norm between any two probability distributions is at most 1. □ 

Now take < a 3 < 1 as in Theorem [T] for the state-collapsible power machine M n , and let as and ae 

1 In 

be any real numbers such that < a§ < < 1. Let 0:7 = a e , and let as be any real number such that 
a7 < as < 1- Define also the minimum non-zero transition probability: 



e* = mm 

constants: 



R/ ' : li m) > 0} 



b 2 = e n j2 
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times: 

t\ = mm{t G N : 1 /m 2 > 2a™ , for all m > t] 

t 2 = mm{t G N : • m 2 )/6 2 < a™ , for all m > t} 

t 3 = mm{t G N : (1/a?) • a? < a£ , for all r > nt} 

random times: 

T A = min {t G N : ||* m - * 4 ,m||TV < a™, Vra > i} (Ta = oo if no such t) 
T B = min {t 6 N : (*j, m )s mn > 1/m 2 , Vm > t} (T B = oo if no such t) 

T c = max{T A ,T B ,ti,t2,t 3 } 

and events: 

A = {T A < 00} , -B = {Tb < 00} , C = A n B 

where * TO = </>(X™" _1 ), * Jiin = 4>. l (X™ n ~ 1 ), and (*i, m )s mn is the component of the probability vector ^ l . m 
corresponding to the state S mn . 

Lemma 21. IfPj(w) > for some state j and word w of length L > 1, then Pj(w) > ej\ 
Proof. Denote w = Wo...wl-i- The claim follows immediately from the decomposition: 

p»= yi v j (xf i =wt i ,st=4)= = 

where sq = j in the product. If the sum is nonzero then some term must be nonzero, in which case that 
term itself, and hence the sum, must be greater than or equal to e%. □ 

Lemma 22. P ; (C) = 1 

Proof. We will show that Vi(A) = 1 and Pi(-B) = 1. The results then follows directly from the definition 
C = AnB. 

• Claim (1) - P f (A) = 1. 

By the relation between the power machine M n and base HMM M we have: 

limsup ||* ijm - ty m \\rv < "3 , Pi a.s. 

The claim follows from this and the fact that as > a^. 

• Claim (2) - ¥j(B) = 1. 



< p.w"- 1 )^ 1 /™ 2 

= \S\/m 2 

Thus, by the Borel-Cantelli Lemma, Pi ((vE'i,m)s m „ < 1/m 2 i.o.) = 0, and the claim follows. 

□ 
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Lemma 23. If the joint sequence (St, X t )t>o is generated according to the measure Pi, then on the event C : 

\\MXo) ~ HXo)\\tv < 4 , for allt>T c -n 

Proof. Let (xg°,sg°) be any i-possible realization of (Xfi°,S§°) such that the event C occurs. That is, 
P[(Xq = Xq, Sq = Sq) > 0, for all t £ N. Let tA,ts, tc be the corresponding values of the random variables 
Ta, Tb, Tq for this realization (xq°, Sq°). For t > tc • n we may decompose t as t = (mn — 1) + r for some 
m > tc and 1 < r < n. By the definition of tc, we know that m is greater than or equal to each of t\, ti, 
ta, tA, and is- From this and some previous Lemmas we obtain the following series of implications. 



(i 
(h 
(iii 

(iv 

(v 

(VI 

(vii 



m > t A => e m = ||VAn - ipi, m \\TV < a™ 

m > h ==> 1/m 2 > 2o?? => -\ - a 1 ? > ± ■ \ 

— / — o o — 2 m z 



l-€„ 



(i) + (ii) + (iii) =^> (6 m ) Smn > i . - <) > r 



i 

2m 2 



Lemma [2~T1 ==>■ IPs m „(a;m„) > > e", since x^„ is in fact generated from state s r , 



(i) + (iv) + (v) (1 - e m ) • P em (x^„) > (1 - e m ) • (0 ro ) Sran • P Smn (^„) > ^ • e ™ = 



_ &2 



(i) =► e m P Mm (^ n ) < e m < < and e m P„ m (a4„) < e m < < 
(viii) (vi) and (vii) together with the fact that m > t 2 imply that: 

e mPfi m ( x mn) < a T < 2 m lu < m 

(1 - e m )¥ 6m (a^J + e m P Mm «J " 6 2 /m 2 + < " 5 7 2 " 6 

and: 

e m$'v m (x m n) ^ ^ 2 m 



< , /^2 , <^X7&2<< 



(1 - e m )P Sm (x^J + e m P„ m (x*„„) " b 2 /w? + < 
(ix) Finally, (viii), Lemma l20l and the fact that m >t^ together imply: 

e mP/x m ( x mn ) £mPi/ m ( X mn ) 



(xq) - ^(xDWtv < max 



(1 - e m )P 6m «„) + e m P, m K m ) ' (1 - e™)Pe m «») + (x* mn ) 
< a ™ = a ™» < (l/ a ») . a* < a * 8 

Since this holds for any i-possible realization (sq°, xg°) such that the event C occurs the claim is proved. 

□ 



Theorem 3. Any edge-emitting HMM with path-mergeable states forgets its initial condition a.s. at expo- 
nential rate or faster (where as before is the memory loss rate for the corresponding state- collapsible 
power machine M n of Lemma \18\) . That is, for each state i: 



limsup H&pT*- 1 ) - ^(X*- 1 )!!^ < a\ /n , P ?; a.s. 

t— >oo 

Proof. This follows directly from Lemmas[52]and[231 and the fact that the constant ag can be made arbitrarily 
close to . □ 
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Remark. By construction the event A must have probability 1 with respect to the measure P; 7 and on the 
event A we have ||0(Xq™" -1 ) — (f>i(XJf ln ~ 1 ) \\tv < ct™ < a™ n , for all sufficiently large m. Thus, it is easy 
to see that a.s. forgetting of the initial condition occurs at an exponential rate along an n-block subsequence. 
From this it seems clear that a.s. forgetting should occur on the entire sequence exponentially fast. That is, 
we should be able to "fill in the gaps". And indeed we can, at least almost surely. However, it is possible to 
construct HMMs such that there exist a symbol x and state distributions tp and ^' with \\ip — iP'Wtv arbitrarily 
close to 0, P,/,(x) > 0,P^'(x) > 0, and \\(j)jp(x) — 4>^> (x)\\tv arbitrarily close to 1. Thus, some care and fairly 
technical arguments as given above are necessary to fill in the gaps. However, conceptually the basic idea is 
pretty simple. Though it is possible to have such distributions ipi'tp' an d symbol x (or word w, with \w\ < n), 
the probability of generating x (or w) from either %j) or if)' is also arbitrarily small. In fact, it is very unlikely 
(from either ip or ip' ) even to generate a symbol x (or word w, \w\ < n) which will separate ip and ip' by very 
much, if they are already close. So, we can show by Borel-Cantelli that almost surely the event of generating 
such unusual symbols x (or words w) that will separate the state distributions 0(Xq 1 " -1 ) and <pi{X(f in ~ 1 ) by 
too much does not occur infinitely often. 



4.2.3 Convergence of Entropy Rate Approximations 

Lemma 24. Let M = (S, X, {T {x] }) be a HMM with power machine M n = (S, W, {Q (tu) }) 7 for some n>2 
and relatively prime to per(M). Let h and h(t) be the entropy rate and order-t approximation for the output 
process of M , and let g and g(t) be the entropy rate and order-t approximation for the output process of M n . 
If 

limsup{ff(i) -g} 1/l < a , 



for some < a < 1, then 



Proof. By definition: 



limsup - h} yt < a 11 



H{Xf) H(Xf) 
g = hm = hm n ■ = n ■ n, and 

t->oo t t->oo nt 

rot— 1 nt — 1 

g(t) = H(x$ t _ l)+1 \X? t - 1) )= H(X T+1 \Xl)= ^+1)- 

r=n(t-l) r=n(t-l) 

Combining these relations gives: 

nt-l 

g{t)-9= X] (h(r + 1) - h) > n ■ (h(nt) - h) 

r=n(t-l) 

Thus, for any r = nt (with t £ N) we have: 

h{r) -h<- (g(r/n) - g) 
n 

The result follows directly from this inequality and the fact that h(r) is monotonically decreasing. □ 

Theorem 4. Let AI = (S, X, {T^ x '}) be an edge-emitting HMM with path-mergeable states, and let h and 
h(t) be the entropy rate and order-t approximation for its output process. Then: 

limsup{/i(t) — h} 1 ^ 1 < cx^ n 

t— >oo 

where < o?4 < 1 is the decay rate ( as given in Theorem 0) of the entropy rate approximations for the 
state-collapsible power machine M™ (of Lemma \18\) . 

Proof. This follows directly from Lemma [24j □ 
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5 Relation to State-Emitting HMMs 



In Section [4] we established exponential convergence of the entropy rate approximations h[t) and an a.s. 
exponential rate of memory loss for edge-emitting HMMs with path-mergeable states. We now show that 
these results for edge-emitting HMMs translate directly to analogous results for state-emitting HMMs. 

The proofs rely primarily on the state-emitting to edge-emitting conversion algorithm given in Section 
12.2.21 For reference, we will denote this conversion algorithm by £. We will denote also, generally, state- 
emitting HMMs by M s and edge-emitting HMMs by M e . We will use P s to denote the stationary probability 
measure on the joint state-symbol sequence (St,X t ) for a state-emitting HMM M s , and P e to denote the 
stationary probability measure over (St,X t ) for an edge-emitting HMM M e . Pf and Pf are the conditional 
measures for M s and M e given that the initial state Sq is state i. The following simple fact will be quite 
useful. The proof is immediate from the nature of the conversion algorithm £. 

Lemma 25. If M s = (S,X,T,0) is a state- emitting HMM and M e = C(M S ), then for any G S n , 
x™ G X n and state i : 

pks? = 4,x? = o = pfosr = s^xr 1 = x?) 

From this lemma it follows that the conversion algorithm £ preserves path-mergeability, that the distri- 
bution over future output from any state i is the same for a state-emitting HMM M s and the corresponding 
edge-emitting HMM M e = C(M S ), and also that the conditional state distributions 4>f(w) and 4>^{w) are 
equivalent for M s and M e . More precisely we have: 

Lemma 26. Let M s be a state- emitting HMM, and let M e = ((M s ). 

1. For any state i and symbol sequence x™, Pf (X™ = a;™) = Pf (X£ _1 = x"). 

2. For any state i and symbol sequence with Pf(X" — x") > 0, (j>f (x™) = ^|(x"). 

3. If M s is path-mergeable then M e is also path-mergeable. 

Using this Lemma we will now show that the edge-emitting results of Section @] translate directly to 
state-emitting HMMS. 

Theorem 5. If M s is a path-mergeable, state- emitting HMM, then h(n) \ h exponentially fast for its 
stationary output process V s = (X t ). 

Proof. Let M e = £(M S ). By Lemma [26l M„ is also path-mergeable. Hence, by Theorem [4j the entropy rate 
csitmates converge exponentially for its stationary output process V e . Since V s and V e are the same process 
(distributionaly) the conclusion follows. □ 

Theorem 6. Any path-mergeable, state- emitting HMM M s a.s. forgets its initial condition at an exponential 
rate. 

Proof. Let M e = £(M S ). By part 3 of Lemma [25] M e is also path-mergeable. Hence, by Theorem [31 M e a.s. 
forgets its condition at an exponential rate. The conclusion follows immediately from this and parts 1 and 
2 of Lemma 1251 □ 

6 Discussion 

The convergence speed of the entropy rate estimates h(n) and rate of memory loss in the initial condition 
are two important (and related) questions in the theory of HMMs. We have established here exponential 
bounds on both these quantities for finite HMMs with path-mergeable states. Below we discuss relations to 
work on related problems by others and also some simple extensions. 
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6.1 Related Work 



The earliest major result on convergence of the entropy rate estimates h(n), and our primary inspiration, 
is reference |14j . which uses a coupling argument to prove an exponential bound on the rate of convergence 
for finite, functional HMMs with strictly positive state transition probabilities. Little else has been done 
directly on this problem till quite recently, though related results which easily imply an exponential rate 
of convergence for the estimates h(n) were also given earlier in |26j using a similar coupling argument, and 
shortly thereafter in |27j with an intuitively similar, but more direct, approach. The terms "loss of memory" 
or "forgetting the initial condition" were not used in this early literature [MlEZ], but mathematically the 
estimates were of this type. However, the results given in these works were also only for (finite) state-emitting 
HMMs with strictly positive transition probabilities: %j > 0, for all 

Of course, if a finite Markov kernel T is aperiodic then some power of it T™ will be strictly positive. So, 
one may be tempted to think these early methods could easily be extended to aperiodic HMMs. But this is 
not as easy as it seems. 

The difficulty arises with the HMM representation. For a finite, aperiodic HMM there always exists some 
length n such that it is possible to transition from each state i to each other state j in n steps, but it may 
not be possible to do so while emitting the same output sequence. And it is this fact that makes coupling 
arguments much more difficult. 

Naturally, strict positivity of the observation matrix O will rectify this problem. In this case, an aperiodic, 
state-emitting HMM with strictly positive observation matrix, a direct coupling argument for the block 
process can be applied to give exponential bounds as well. 

We have established here, however, exponential convergence bounds under the weaker condition of path- 
mcrgcability, which does not require strict positivity of either O or T. Indeed, it is easy to see that for a 
finite state-emitting HMM path-mergeability is a strictly weaker condition than either (a) positivity of T 
or (b) positivity of O + aperiodicity. In fact, in the case of specifically finite HMMs path-mcrgcability is 
the weakest condition we are aware of for any results on loss of memory or convergence of the entropy rate 
estimates. 

However, over the last few decades there has also has been substantial work done on the rate of memory 
loss for HMMs in a variety of other (and often more general) settings [17Tl24| . Again primarily focusing on 
state-emitting models, but extending to R^ valued (or more general) outputs, and often beyond a finite 
internal state set as well. In fact, sometimes even to continuous time as in [T9j[22] and parts of [20]. The 
literature is too vast to accurately summarize all of, but good estimates of the rate of memory loss have 
been established in many instances with a variety of different methods: e.g., Lyapunov exponent theory and 
properties of the Birkhoff contraction coefficient and Hilbert projective metric. 

Because of various differences in the models, it is difficult to compare many of these more recent results 
on memory loss to our own directly. However, we do note that much of this more recent work has also relied 
on strong positivity assumptions. For example, in [17] and parts of [18, 20J it is assumed that the Markov 
kernel T is strictly positive (i.e., the conditional measure from each state has a strictly positive density 
with respect to some fixed reference measure) and in [I7|[T5J[20, 21 , 24 it is assumed that the observation 
kernel O is strictly positive (in the same sense). In the case that the observation kernel (but not the Markov 
transition kernel) is taken to be strictly positive and the state set is finite [HIET], it is also assumed that 
Markov chain is aperiodic. Restricted, at least, to the case of finite HMMs, these are stronger assumptions 
than path-mergcability. 

We should mention also, though, reference [23], where an exponential bound on the a.s. rate of memory 
loss is established without assuming strict positivity of either T or O. The authors were studying, generally, 
state-emitting HMMs with output space X = R" and internal state space S C R m , where the kernels and 
invariant state distribution have densities with respect to some arbitrary reference measures. But, their 
framework covers the case of finite HMMs and, translated to the finite case, their assumption amounts to 
the following: 

There exists i € S such that %j > 0, for all j. (8) 
This mixing condition ([8]) is neither equivalent, strictly stronger, or strictly weaker than path-mergeability 
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for a finite, state-emitting HMM. Indeed, there exist simple examples of path-mergeable HMMs which do 
not satisfy this condition, but also HMMs which do satisfy this condition but are not path-mergeable. It 
can be shown, however, by an iterative construction similar to that used in the proof of Lemma IT51 that if 
M s is a finite, state-emitting HMM satisfying condition ©, and M e = C,(M S ), then M™ is state-collapsible 
for some n £ N. From this one may obtain (as shown in Sections @] and [5]) exponential bounds on both the 
rate of memory loss and convergence rate of the estimates h(n) for the HMM M s . 

6.2 Extensions 

Path-mergeability is a sufficient condition for both exponential convergence of the entropy rate estimates and 
an exponential rate of memory loss. However, as discussed above, it is not a necessary one. Indeed, it seems 
likely that the entropy estimates h(n) converge exponentially for any finite HMM. Proving this in general 
may be difficult, but we give below one simple extension of path-mcrgcability to a more general condition 
where exponential convergence docs hold. 

Let us say two states i,j of a HMM are incompatible if there exists some length L such that the set of 
length- L words which can be generated from state i and state j have no overlap. More precisely, if for each 
w with \w\ = L we have cither: 

P;p4 u ' hl = w) = or FjiX^ 1 = w) = (or both), for an edge-emitting HMM 
Fi(x[ wl = w) = or Pj(x[ wl =w)=0 (or both), for a state-emitting HMM 

Consider the following condition: 

Each pair of states i,j is either path-mergeable or incompatible. (9) 

This condition (|5J) is clearly weaker than path-mergeability, but if an edge-emitting HMM satisfies this condi- 
tion then it can be shown, by small modifications of the construction given in Lemma [TBI for path-meregeable 
HMMs, that some power of it is also state-collapsible. From this one may obtain exponential bounds on 
convergence of the entropy rate estimates h(n) and rate of memory loss, as in Section HI Analogous results 
hold for state-emitting HMMs satisfying ^ as well, since the standard conversion algorithm £ preserves 
both path-mergeability and incompatibility for each given pair of states i,j. 

In fact, one can also use a weaker definition of incompatibility for state-emitting HMMs where the symbol 
Xq is included as part of w (i.e., apply the edge-emitting definition for state-emitting HMMs), and still have 
exponential convergence of the entropy rate estimate h(n) whenever ([9]) is satisfied. This does not follow 
immediately from the standard state-emitting to edge-emitting conversion £, but if one runs the conversion 
in the other direction instead: 

(S, X, T, O) -> (S, X, {T' {x) }) where %} x) = TijO ix , 

instead of 7^- = TijOj X , then the property (|5J) is preserved under this conversion with the alternative 
state-emitting version of incompatibility including Xq. And, from this one may obtain the desired result. 

If, however, there exist state pairs i,j that are not path-mergeable or incompatible (in any sense), then 
the situation becomes significantly more complicated. Coupling arguments, at least, are quite difficult in 
this case. 
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